Business Problem

Large production houses spend millions of dollars upfront to produce movies and only see the relative success or failure many months later. This means that unlike many other businesses, they are not able to pilot or test their product with smaller budgets and so have higher risk. Given this scenario, they need to be able optimize their initial spends by choosing the right combination of factors (genre, actors, director, timing of release etc) or, if given a certain combination of factors (e.g. genre, actors and director), need to be able to make decisions on budget and timing to give the movie the largest chance of success.

Business Solution Process

We use a database of 5000 movies on the Kaggle to develop predictive models for movie production houses. By using decision tree and random forest, we will develop models that can address the following questions:

Packages Required

To begin, we have used the following R packages installed:

#Loading the required packages
library(readr)
library(jsonlite)
library(tidyr)
library(dplyr)
library(DT)
library(knitr)
library(ggplot2) # visualization
library(ggrepel)
library(ggthemes) # visualization
library(scales) # visualization
library(dplyr) # data manipulation
library(VIM)
library(data.table)
library(formattable)
library(plotly)
library(corrplot)
library(GGally)
library(caret)
library(car)

```

Descriptive Statistics

Data Source

The original dataset is obtained from Kaggle. We have downloaded the dataset from this source and hosted in our group GitHub profile.

Data Description

The TMDB dataset was generated from The Movie Database API. It has a set of total 20 columns which are factors that potentially affect the success of the movies. These factors could be potentially grouped into the following categories:

  • Financial includes columns such as “budget” and “revenue”
  • Perception includes columns such as “votes”,“average rating” etc.
  • production includes columns such as “production countries”, “released date”, “genre”, “production house”,“languages” etc.

Data Importing

Fetching the Data from GitHub

We perform the data importing from the github profile where we have hosted the data. The url for the data is set to the variable url and the data is read into the object df

#URL to read the data from
url <- "https://raw.githubusercontent.com/insead-dsb-g08/predicting-movie-success/master/movie_metadata.csv"
#Reading the csv file from the URL
IMDB <- read_csv(url,col_names = TRUE,na = "NA")
#Preview of the data dimensions and column names
dim(IMDB)
## [1] 5043   28
#Examining the column names in the dataset
colnames(IMDB)
##  [1] "color"                     "director_name"            
##  [3] "num_critic_for_reviews"    "duration"                 
##  [5] "director_facebook_likes"   "actor_3_facebook_likes"   
##  [7] "actor_2_name"              "actor_1_facebook_likes"   
##  [9] "gross"                     "genres"                   
## [11] "actor_1_name"              "movie_title"              
## [13] "num_voted_users"           "cast_total_facebook_likes"
## [15] "actor_3_name"              "facenumber_in_poster"     
## [17] "plot_keywords"             "movie_imdb_link"          
## [19] "num_user_for_reviews"      "language"                 
## [21] "country"                   "content_rating"           
## [23] "budget"                    "title_year"               
## [25] "actor_2_facebook_likes"    "imdb_score"               
## [27] "aspect_ratio"              "movie_facebook_likes"

We have 5043 observations of 28 variables. The response variable “imdb_score” is numerical, and the predictors are mixed with numerical and categorical variables.

2.2 Remove Duplicates

In the IMDB data, we have some duplicate rows. We want to remove the 45 duplicated rows and keep the unique ones.

# duplicate rows
sum(duplicated(IMDB))
## [1] 45
# delete duplicate rows
IMDB <- IMDB[!duplicated(IMDB), ]

We get 4998 observations left.

2.3 Tidy Up Movie Title

All the movie titles have a special character (Â) at the end and some have whitespaces, they might be generated during the data collection. Let’s remove them.

library(stringr)
## Warning: package 'stringr' was built under R version 3.5.3
IMDB$movie_title <- gsub("Â", "", as.character(factor(IMDB$movie_title)))
str_trim(IMDB$movie_title, side = "right")

2.4 Split Genres

Each record of genres is combined with a few types, which will cause the difficulty of analyzing.

head(IMDB$genres)
## [1] "Action|Adventure|Fantasy|Sci-Fi" "Action|Adventure|Fantasy"       
## [3] "Action|Adventure|Thriller"       "Action|Thriller"                
## [5] "Documentary"                     "Action|Adventure|Sci-Fi"

First, we want to know if genre is related to imdb score. We divide the string into several substrings by the separator ‘|’, and save each substring along with its correspongding imdb score in the other data frame genres.df. Then we plot a histogram for the score and genres to see if they are relative or not.

# create a new data frame
genres.df <- as.data.frame(IMDB[,c("genres", "imdb_score")])
# separate different genres into new columns
genres.df$Action <- sapply(1:length(genres.df$genres), function(x) if (genres.df[x,1] %like% "Action") 1 else 0)
genres.df$Adventure <- sapply(1:length(genres.df$genres), function(x) if (genres.df[x,1] %like% "Adventure") 1 else 0)
genres.df$Animation <- sapply(1:length(genres.df$genres), function(x) if (genres.df[x,1] %like% "Animation") 1 else 0)
genres.df$Biography <- sapply(1:length(genres.df$genres), function(x) if (genres.df[x,1] %like% "Biography") 1 else 0)
genres.df$Comedy <- sapply(1:length(genres.df$genres), function(x) if (genres.df[x,1] %like% "Comedy") 1 else 0)
genres.df$Crime <- sapply(1:length(genres.df$genres), function(x) if (genres.df[x,1] %like% "Crime") 1 else 0)
genres.df$Documentary <- sapply(1:length(genres.df$genres), function(x) if (genres.df[x,1] %like% "Documentary") 1 else 0)
genres.df$Drama <- sapply(1:length(genres.df$genres), function(x) if (genres.df[x,1] %like% "Drama") 1 else 0)
genres.df$Family <- sapply(1:length(genres.df$genres), function(x) if (genres.df[x,1] %like% "Family") 1 else 0)
genres.df$Fantasy <- sapply(1:length(genres.df$genres), function(x) if (genres.df[x,1] %like% "Fantasy") 1 else 0)
genres.df$`Film-Noir` <- sapply(1:length(genres.df$genres), function(x) if (genres.df[x,1] %like% "Film-Noir") 1 else 0)
genres.df$History <- sapply(1:length(genres.df$genres), function(x) if (genres.df[x,1] %like% "History") 1 else 0)
genres.df$Horror <- sapply(1:length(genres.df$genres), function(x) if (genres.df[x,1] %like% "Horror") 1 else 0)
genres.df$Musical <- sapply(1:length(genres.df$genres), function(x) if (genres.df[x,1] %like% "Musical") 1 else 0)
genres.df$Mystery <- sapply(1:length(genres.df$genres), function(x) if (genres.df[x,1] %like% "Mystery") 1 else 0)
genres.df$News <- sapply(1:length(genres.df$genres), function(x) if (genres.df[x,1] %like% "News") 1 else 0)
genres.df$Romance <- sapply(1:length(genres.df$genres), function(x) if (genres.df[x,1] %like% "Romance") 1 else 0)
genres.df$`Sci-Fi` <- sapply(1:length(genres.df$genres), function(x) if (genres.df[x,1] %like% "Sci-Fi") 1 else 0)
genres.df$Short <- sapply(1:length(genres.df$genres), function(x) if (genres.df[x,1] %like% "Short") 1 else 0)
genres.df$Sport <- sapply(1:length(genres.df$genres), function(x) if (genres.df[x,1] %like% "Sport") 1 else 0)
genres.df$Thriller <- sapply(1:length(genres.df$genres), function(x) if (genres.df[x,1] %like% "Thriller") 1 else 0)
genres.df$War <- sapply(1:length(genres.df$genres), function(x) if (genres.df[x,1] %like% "War") 1 else 0)
genres.df$Western <- sapply(1:length(genres.df$genres), function(x) if (genres.df[x,1] %like% "Western") 1 else 0)
# get the mean of imdb score for different genres
means <- rep(0,23)
for (i in 1:23) {
  means[i] <- mean(genres.df$imdb_score[genres.df[i+2]==1])
}
# plot the means
barplot(means, main = "Average imdb scores for different genres")

There isn’t much difference in the averages of imdb score related to different genres, almost all the averages are in the same range of 6~8. So we think the predictor “genres” can be removed because it’s not really related to the score.

IMDB <- subset(IMDB, select = -c(genres))

3 Data Cleaning

3.1 Missing Values

To find missing values in each column, we use colSums() function to aggregate NA in each column.

colSums(sapply(IMDB, is.na))
##                     color             director_name 
##                         0                         0 
##    num_critic_for_reviews                  duration 
##                        49                        15 
##   director_facebook_likes    actor_3_facebook_likes 
##                       103                        23 
##              actor_2_name    actor_1_facebook_likes 
##                         0                         7 
##                     gross              actor_1_name 
##                       874                         0 
##               movie_title           num_voted_users 
##                         0                         0 
## cast_total_facebook_likes              actor_3_name 
##                         0                         0 
##      facenumber_in_poster             plot_keywords 
##                        13                         0 
##           movie_imdb_link      num_user_for_reviews 
##                         0                        21 
##                  language                   country 
##                         0                         0 
##            content_rating                    budget 
##                         0                       487 
##                title_year    actor_2_facebook_likes 
##                       107                        13 
##                imdb_score              aspect_ratio 
##                         0                       327 
##      movie_facebook_likes 
##                         0

Let’s use heatmap to visualize missing values.

missing.values <- aggr(IMDB, sortVars = T, prop = T, sortCombs = T, cex.lab = 1.5, cex.axis = .6, cex.numbers = 5, combined = F, gap = -.2)

## 
##  Variables sorted by number of missings: 
##                   Variable       Count
##                      gross 0.174869948
##                     budget 0.097438976
##               aspect_ratio 0.065426170
##                 title_year 0.021408563
##    director_facebook_likes 0.020608243
##     num_critic_for_reviews 0.009803922
##     actor_3_facebook_likes 0.004601841
##       num_user_for_reviews 0.004201681
##                   duration 0.003001200
##       facenumber_in_poster 0.002601040
##     actor_2_facebook_likes 0.002601040
##     actor_1_facebook_likes 0.001400560
##                      color 0.000000000
##              director_name 0.000000000
##               actor_2_name 0.000000000
##               actor_1_name 0.000000000
##                movie_title 0.000000000
##            num_voted_users 0.000000000
##  cast_total_facebook_likes 0.000000000
##               actor_3_name 0.000000000
##              plot_keywords 0.000000000
##            movie_imdb_link 0.000000000
##                   language 0.000000000
##                    country 0.000000000
##             content_rating 0.000000000
##                 imdb_score 0.000000000
##       movie_facebook_likes 0.000000000

3.1.1 Delete some rows

Since gross and budget have too many missing values, and we want to keep these two variables for the following analysis, we can only delete rows with null values for gross and budget because imputation will not do a good job here.

IMDB <- IMDB[!is.na(IMDB$gross), ]
IMDB <- IMDB[!is.na(IMDB$budget), ]
dim(IMDB)
## [1] 3857   27

Not too bad, we only omitted 23% of the observations. Now our data has 3857 observations.

Let’s see how many complete cases we have.

sum(complete.cases(IMDB))
## [1] 3768

So, there are still 3857 - 3768 = 89 rows with NAs.

3.1.2 Analyze aspect ratio

let’s take a look at rest columns with missing values.

colSums(sapply(IMDB, is.na))
##                     color             director_name 
##                         0                         0 
##    num_critic_for_reviews                  duration 
##                         1                         1 
##   director_facebook_likes    actor_3_facebook_likes 
##                         0                        10 
##              actor_2_name    actor_1_facebook_likes 
##                         0                         3 
##                     gross              actor_1_name 
##                         0                         0 
##               movie_title           num_voted_users 
##                         0                         0 
## cast_total_facebook_likes              actor_3_name 
##                         0                         0 
##      facenumber_in_poster             plot_keywords 
##                         6                         0 
##           movie_imdb_link      num_user_for_reviews 
##                         0                         0 
##                  language                   country 
##                         0                         0 
##            content_rating                    budget 
##                         0                         0 
##                title_year    actor_2_facebook_likes 
##                         0                         5 
##                imdb_score              aspect_ratio 
##                         0                        74 
##      movie_facebook_likes 
##                         0

Now aspect_ratio has the highest number of missing values. Before trying to impute the missing values, we want to check how important is this variable.

table(IMDB$aspect_ratio)
## 
## 1.18 1.33 1.37  1.5 1.66 1.75 1.77 1.78 1.85    2  2.2 2.24 2.35 2.39  2.4 
##    1   19   50    1   40    2    1   41 1600    3   10    1 1995   11    3 
## 2.55 2.76   16 
##    1    3    1

The most common aspect ratios are 1.85 and 2.35. For analyzing purpose, we group other ratios together.

In order to compute the mean of imdb score for different aspect_ratio, we need to replace NA with 0 first.

IMDB$aspect_ratio[is.na(IMDB$aspect_ratio)] <- 0
mean(IMDB$imdb_score[IMDB$aspect_ratio == 1.85])
## [1] 6.373938
mean(IMDB$imdb_score[IMDB$aspect_ratio == 2.35])
## [1] 6.508471
mean(IMDB$imdb_score[IMDB$aspect_ratio != 1.85 & IMDB$aspect_ratio != 2.35])
## [1] 6.672519

From the means of imdb score for different aspect ratios, we can see there is no significant difference, all the means fall in the range of 6.3~6.8. So, removing this variable won’t affect our following analysis.

IMDB <- subset(IMDB, select = -c(aspect_ratio))

3.1.3 Deal with 0s

We notice that there are some 0 values which should also be regarded as missing value except for predictor facenumber_in_poster.

First we need to replace NA with column average for facenumber_in_poster, then replace 0s in other predictors with NA, and lastly replace all NAs with their respective column mean.

# replace NA with column average for facenumber_in_poster
IMDB$facenumber_in_poster[is.na(IMDB$facenumber_in_poster)] <- round(mean(IMDB$facenumber_in_poster, na.rm = TRUE))
# convert 0s into NAs for other predictors
IMDB[,c(5,6,8,13,24,26)][IMDB[,c(5,6,8,13,24,26)] == 0] <- NA
# impute missing value with column mean
IMDB$num_critic_for_reviews[is.na(IMDB$num_critic_for_reviews)] <- round(mean(IMDB$num_critic_for_reviews, na.rm = TRUE))
IMDB$duration[is.na(IMDB$duration)] <- round(mean(IMDB$duration, na.rm = TRUE))
IMDB$director_facebook_likes[is.na(IMDB$director_facebook_likes)] <- round(mean(IMDB$director_facebook_likes, na.rm = TRUE))
IMDB$actor_3_facebook_likes[is.na(IMDB$actor_3_facebook_likes)] <- round(mean(IMDB$actor_3_facebook_likes, na.rm = TRUE))
IMDB$actor_1_facebook_likes[is.na(IMDB$actor_1_facebook_likes)] <- round(mean(IMDB$actor_1_facebook_likes, na.rm = TRUE))
IMDB$cast_total_facebook_likes[is.na(IMDB$cast_total_facebook_likes)] <- round(mean(IMDB$cast_total_facebook_likes, na.rm = TRUE))
IMDB$actor_2_facebook_likes[is.na(IMDB$actor_2_facebook_likes)] <- round(mean(IMDB$actor_2_facebook_likes, na.rm = TRUE))
IMDB$movie_facebook_likes[is.na(IMDB$movie_facebook_likes)] <- round(mean(IMDB$movie_facebook_likes, na.rm = TRUE))

Now we finished imputing the numeric missing values. There are still some categorical missing values, let’s take a look.

3.1.4 Sort out content ratings

We find there are still some missing values in content_rating, which are marked as “”.

table(IMDB$content_rating)
## 
##            Approved         G        GP         M     NC-17 Not Rated 
##        51        17        91         1         2         6        42 
##    Passed        PG     PG-13         R   Unrated         X 
##         3       573      1314      1723        24        10

Blanks should be taken as missing value. Since these missing values cannot be replaced with reasonable data, we delete these rows.

IMDB <- IMDB[!(IMDB$content_rating %in% ""),]

According to the history of naming these different content ratings, we find M = GP = PG, X = NC-17. We want to replace M and GP with PG, replace X with NC-17, because these two are what we use nowadays.

IMDB$content_rating[IMDB$content_rating == 'M']   <- 'PG' 
IMDB$content_rating[IMDB$content_rating == 'GP']  <- 'PG' 
IMDB$content_rating[IMDB$content_rating == 'X']   <- 'NC-17'

We want to replace “Approved”, “Not Rated”, “Passed”, “Unrated” with the most common rating “R”.

IMDB$content_rating[IMDB$content_rating == 'Approved']  <- 'R' 
IMDB$content_rating[IMDB$content_rating == 'Not Rated'] <- 'R' 
IMDB$content_rating[IMDB$content_rating == 'Passed']    <- 'R' 
IMDB$content_rating[IMDB$content_rating == 'Unrated']   <- 'R' 
IMDB$content_rating <- factor(IMDB$content_rating)
table(IMDB$content_rating)
## 
##     G NC-17    PG PG-13     R 
##    91    16   576  1314  1809

Now we only have 5 different content ratings.

3.2 Add Columns

We have gross and budget information. So let’s add two colums: profit and percentage return on investment for further analysis.

IMDB <- IMDB %>% 
  mutate(profit = gross - budget,
         return_on_investment_perc = (profit/budget)*100)

3.3 Remove Columns

3.3.1 Is the color of a movie influential?

table(IMDB$color)
## 
##                 Black and White           Color 
##               2             124            3680

More than 96% movies are colored, which indicates that this predictor is nearly constant. Let’s remove this predictor.

# delete predictor color
IMDB <- subset(IMDB, select = -c(color))

3.3.2 Is language an important factor for imdb score? What about country?

table(IMDB$language)
## 
##            Aboriginal     Arabic    Aramaic    Bosnian  Cantonese 
##          2          2          1          1          1          7 
##      Czech     Danish       Dari      Dutch    English   Filipino 
##          1          3          2          3       3644          1 
##     French     German     Hebrew      Hindi  Hungarian Indonesian 
##         34         11          2          5          1          2 
##    Italian   Japanese     Kazakh     Korean   Mandarin       Maya 
##          7         10          1          5         14          1 
##  Mongolian       None  Norwegian    Persian Portuguese   Romanian 
##          1          1          4          3          5          1 
##    Russian    Spanish       Thai Vietnamese       Zulu 
##          1         24          3          1          1

Over 95% movies are in English, which means this variable is nearly constant. Let’s remove it.

IMDB <- subset(IMDB, select = -c(language))

Let’s take a look at predictor country.

table(IMDB$country)
## 
##    Afghanistan      Argentina          Aruba      Australia        Belgium 
##              1              3              1             40              1 
##         Brazil         Canada          Chile          China       Colombia 
##              5             63              1             13              1 
## Czech Republic        Denmark        Finland         France        Georgia 
##              3              9              1            103              1 
##        Germany         Greece      Hong Kong        Hungary        Iceland 
##             79              1             13              2              1 
##          India      Indonesia           Iran        Ireland         Israel 
##              5              1              4              7              2 
##          Italy          Japan         Mexico    Netherlands       New Line 
##             11             15             10              3              1 
##    New Zealand         Norway  Official site           Peru    Philippines 
##             11              4              1              1              1 
##         Poland        Romania         Russia   South Africa    South Korea 
##              1              2              3              3              8 
##          Spain         Taiwan       Thailand             UK            USA 
##             22              2              4            316           3025 
##   West Germany 
##              1

Around 79% movies are from USA, 8% from UK, 13% from other countries. So we group other countries together to make this categorical variable with less levels: USA, UK, Others.

levels(IMDB$country) <- c(levels(IMDB$country), "Others")
IMDB$country[(IMDB$country != 'USA')&(IMDB$country != 'UK')] <- 'Others' 
IMDB$country <- factor(IMDB$country)
table(IMDB$country)
## 
## Others     UK    USA 
##    465    316   3025

4 Data Visualization

4.1 Histogram of Movie Released

Movie production just exploded after year 1990. It could be due to advancement in technology and commercialisation of internet.

ggplot(IMDB, aes(title_year)) +
  geom_bar() +
  labs(x = "Year movie was released", y = "Movie Count", title = "Histogram of Movie released") +
  theme(plot.title = element_text(hjust = 0.5))

From the graph, we see there aren’t many records of movies released before 1980. It’s better to remove those records because they might not be representative.

IMDB <- IMDB[IMDB$title_year >= 1980,]

4.2 Top 20 movies based on its Profit

IMDB %>%
  filter(title_year %in% c(2000:2016)) %>%
  arrange(desc(profit)) %>%
  top_n(20, profit) %>%
  ggplot(aes(x=budget/1000000, y=profit/1000000)) +
  geom_point() +
  geom_smooth() + 
  geom_text_repel(aes(label=movie_title)) +
  labs(x = "Budget $million", y = "Profit $million", title = "Top 10 Profitable Movies") +
  theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

These are the top 20 movies based on the Profit earned (Gross - Budget). It can be inferred from this plot that high budget movies tend to earn more profit. The trend is almost linear, with profit increasing with the increase in budget.

4.3 Top 20 movies based on its Return on Investment

IMDB %>%
  filter(budget > 100000) %>%
  mutate(profit = gross - budget,
         return_on_investment_perc = (profit/budget)*100) %>%
  arrange(desc(profit)) %>%
  top_n(20, profit) %>%
  ggplot(aes(x=budget/1000000, y = return_on_investment_perc)) + 
  geom_point(size = 2) + 
  geom_smooth(size = 1) + 
  geom_text_repel(aes(label = movie_title), size = 3) + 
  xlab("Budget $million") + 
  ylab("Percent Return on Investment") + 
  ggtitle("20 Most Profitable Movies based on its Return on Investment")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

These are the top 20 movies based on its Percentage Return on Investment ((profit/budget)*100).

Since profit earned by a movie does not give a clear picture about its monetary success over the years, this analysis, over the absolute value of the Return on Investment(ROI) across its Budget, would provide better results.

As hypothesized, the ROI is high for Low Budget Films and decreases as the budget of the movie increases.

4.4 Top 20 directors with highest average IMDB score

IMDB %>%
  group_by(director_name) %>%
  summarise(avg_imdb = mean(imdb_score)) %>%
  arrange(desc(avg_imdb)) %>%
  top_n(20, avg_imdb) %>%
  formattable(list(avg_imdb = color_bar("orange")), align = 'l')
director_name avg_imdb
Tony Kaye 8.600000
Damien Chazelle 8.500000
Majid Majidi 8.500000
Ron Fricke 8.500000
Christopher Nolan 8.425000
Asghar Farhadi 8.400000
Marius A. Markevicius 8.400000
Richard Marquand 8.400000
Sergio Leone 8.400000
Lee Unkrich 8.300000
Lenny Abrahamson 8.300000
Pete Docter 8.233333
Hayao Miyazaki 8.225000
Joshua Oppenheimer 8.200000
Juan José Campanella 8.200000
Quentin Tarantino 8.200000
David Sington 8.100000
Je-kyu Kang 8.100000
Terry George 8.100000
Tim Miller 8.100000

4.5 Commercial Success v.s. Critical Acclaim

IMDB %>%
  top_n(20, profit) %>%
  ggplot(aes(x = imdb_score, y = gross/10^6, size = profit/10^6, color = content_rating)) + 
  geom_point() + 
  geom_hline(aes(yintercept = 600)) + 
  geom_vline(aes(xintercept = 7.75)) + 
  geom_text_repel(aes(label = movie_title), size = 4) +
  xlab("Imdb score") + 
  ylab("Gross money earned in million dollars") + 
  ggtitle("Commercial success Vs Critical acclaim") +
  annotate("text", x = 8.5, y = 700, label = "High ratings \n & High gross") +
  theme(plot.title = element_text(hjust = 0.5))

This is an analysis on the Commercial Success acclaimed by the movie (Gross earnings and profit earned) v.s. its IMDB Score.

As expected, there is not much correlation since most critically acclaimed movies do not do much well commercially.

4.6 Relation between number of facebook likes and imdb_score

IMDB %>%
  plot_ly(x = ~movie_facebook_likes, y = ~imdb_score, color = ~content_rating , mode = "markers", text = ~content_rating, alpha = 0.7, type = "scatter")

We divide this scatter plot by content-rating. Movie with extremely high Facebook likes tend to have higher imdb score. But the score for movie with low Facebook likes vary in a very wide range.

5 Data Pre-processing

5.1 Remove Names

We have 1660 directors, and 3621 actors in this data.

# number of directors
sum(uniqueN(IMDB$director_name))
## [1] 1660
# number of actors
sum(uniqueN(IMDB[, c("actor_1_name", "actor_2_name", "actor_3_name")]))
## [1] 3621

Since all the names are so different for the whole dataset, there is no point to use names to predict score.

Same with plot keywords, they are too diverse to be used in the prediction.

And movie link is also a redundant variable.

IMDB <- subset(IMDB, select = -c(director_name, actor_2_name, actor_1_name,
                                 movie_title, actor_3_name, plot_keywords, 
                                 movie_imdb_link))

5.2 Remove Linear Dependent Variables

For the purpose of data exploration, we added two variables based on existing variables: profit and return_on_investment_perc. In order to avoid multicollinearity, here we remove these two added variables.

IMDB <- subset(IMDB, select = -c(profit, return_on_investment_perc))

5.3 Remove Highly Correlated Variables

First we plot the correlation heatmap for our data.

ggcorr(IMDB, label = TRUE, label_round = 2, label_size = 3.5, size = 2, hjust = .85) +
  ggtitle("Correlation Heatmap") +
  theme(plot.title = element_text(hjust = 0.5))

Based on the heatmap, we can see some high correlations (greater than 0.7) between predictors.

According to the highest correlation value 0.95, we find actor_1_facebook_likes is highly correlated with the cast_total_facebook_likes, and both actor2 and actor3 are also somehow correlated to the total. So we want to modify them into two variables: actor_1_facebook_likes and other_actors_facebook_likes.

There are high correlations among num_voted_users, num_user_for_reviews and num_critic_for_reviews. We want to keep num_voted_users and take the ratio of num_user_for_reviews and num_critic_for_reviews.

# add up actor 2 and 3 facebook likes into other actors facebook likes
IMDB$other_actors_facebook_likes <- IMDB$actor_2_facebook_likes + IMDB$actor_3_facebook_likes
# use the ratio of critical reviews amount to total reviews amount
IMDB$critic_review_ratio <- IMDB$num_critic_for_reviews / IMDB$num_user_for_reviews
# delete columns
IMDB <- subset(IMDB, select = -c(cast_total_facebook_likes, actor_2_facebook_likes, actor_3_facebook_likes, num_critic_for_reviews, num_user_for_reviews))

Now the correlation heatmap becomes like this.

ggcorr(IMDB, label = TRUE, label_round = 2, label_size = 4, size = 3, hjust = .85) +
  ggtitle("Correlation Heatmap") +
  theme(plot.title = element_text(hjust = 0.5))

We don’t see any strong correlation (absolute value greater than 0.7) any more.

5.4 Bin Response Variable

Our goal is to build a model, which can help us predict if a movie is good or bad. So we don’t really want an exact score to be predicted, we only want to know how good or how bad is the movie. Therefore, we bin the score into 4 buckets: less than 4, 4~6, 6~8 and 8~10, which represents bad, OK, good and excellent respectively.

IMDB$binned_score <- cut(IMDB$imdb_score, breaks = c(0,4,6,8,10))

5.5 Organize the dataset

We want to reorder the columns to make the dataset easier to be understood. And we also renamed the columns to make the names shorter.

IMDB <- IMDB[,c(9,4,5,14,12,2,3,13,1,6,10,7,8,11,15)]
colnames(IMDB) <- c("budget", "gross", "user_vote", "critic_review_ratio",
                    "movie_fb", "director_fb", "actor1_fb", "other_actors_fb",
                    "duration", "face_number", "year", "country", "content",
                    "imdb_score", "binned_score")

5.6 Split Data

Here we split data into training, validation and test sets with the ratio of 6:2:2.

set.seed(45)
train.index <- sample(row.names(IMDB), dim(IMDB)[1]*0.6)
valid.index <- sample(setdiff(row.names(IMDB), train.index), dim(IMDB)[1]*0.2)
test.index <- setdiff(row.names(IMDB), union(train.index, valid.index))
train <- IMDB[train.index, ]
valid <- IMDB[valid.index, ]
test <- IMDB[test.index, ]

6 Implement Algorithm

6.1 Classification Tree

6.1.1 Full-grown Tree

library(rpart)
## Warning: package 'rpart' was built under R version 3.5.3
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.5.3
# Full grown tree
class.tree <- rpart(binned_score ~ . -imdb_score, data = train, method = "class")
## plot tree
prp(class.tree, type = 1, extra = 1, under = TRUE, split.font = 2, varlen = 0) 

Classification rules:

    1. If (user_vote >= 551000) then class = (8,10].
    1. If (83000 <= user_vote < 551000) then class = (6,8].
    1. If (user_vote < 83000) and (duration >= 106) then class = (6,8].
    1. If (user_vote < 83000) and (duration < 106) and (gross < 7900000) then class = (6,8].
    1. If (user_vote < 83000) and (duration < 106) and (gross >= 7900000) and (movie_fb < 4500) then class = (4,6].
    1. If (user_vote < 83000) and (duration < 106) and (gross >= 7900000) and (movie_fb >= 4500) and (year < 2000) then class = (6,8].
    1. If (user_vote < 83000) and (duration < 106) and (gross >= 7900000) and (movie_fb >= 4500) and (year >= 2000) and (critic_review_ratio < 1.2) then class = (4,6].
    1. If (user_vote < 41000) and (duration < 106) and (gross >= 7900000) and (movie_fb >= 4500) and (year >= 2000) and (critic_review_ratio >= 1.2) then class = (4,6].
    1. If (41000 <= user_vote < 83000) and (duration < 106) and (gross >= 7900000) and (movie_fb >= 4500) and (year >= 2000) and (critic_review_ratio >= 1.2) then class = (6,8].

From these rules, we can conclude that movies with a lot of votes in imdb website tend to have a higher score, which really makes sense because popular movies will have a lot of fans to vote high scores for them.

On the contrary, if a movie has fewer votes, it can still be a good movie if its duration is longer (rule #3).

It is surprise to see that movies make less profit are good, but ok if they make more profit (rule #4).

6.1.2 Best-pruned Tree

# cross-validation procedure
# argument cp sets the smallest value for the complexity parameter.
set.seed(51)
cv.ct <- rpart(binned_score ~ . -imdb_score, data = train, method = "class", 
               cp = 0.00001, minsplit = 5, xval = 5)
printcp(cv.ct)
## 
## Classification tree:
## rpart(formula = binned_score ~ . - imdb_score, data = train, 
##     method = "class", cp = 1e-05, minsplit = 5, xval = 5)
## 
## Variables actually used in tree construction:
##  [1] actor1_fb           budget              content            
##  [4] country             critic_review_ratio director_fb        
##  [7] duration            face_number         gross              
## [10] movie_fb            other_actors_fb     user_vote          
## [13] year               
## 
## Root node error: 798/2226 = 0.35849
## 
## n= 2226 
## 
##            CP nsplit rel error  xerror     xstd
## 1  0.06390977      0   1.00000 1.00000 0.028353
## 2  0.04636591      3   0.80827 0.86216 0.027322
## 3  0.01691729      4   0.76190 0.79574 0.026697
## 4  0.00751880      8   0.69424 0.77694 0.026503
## 5  0.00626566     10   0.67920 0.76316 0.026357
## 6  0.00563910     13   0.66040 0.75815 0.026303
## 7  0.00543024     15   0.64912 0.75815 0.026303
## 8  0.00501253     20   0.61278 0.75564 0.026276
## 9  0.00407268     25   0.58772 0.76817 0.026411
## 10 0.00375940     29   0.57143 0.77820 0.026517
## 11 0.00325815     45   0.50877 0.78070 0.026543
## 12 0.00322234     50   0.49248 0.78446 0.026582
## 13 0.00313283     57   0.46992 0.78446 0.026582
## 14 0.00292398     63   0.45113 0.79574 0.026697
## 15 0.00250627     66   0.44236 0.79574 0.026697
## 16 0.00219298    112   0.31830 0.80827 0.026821
## 17 0.00187970    120   0.29950 0.80451 0.026784
## 18 0.00167084    133   0.27444 0.80576 0.026797
## 19 0.00156642    142   0.25940 0.80576 0.026797
## 20 0.00125313    151   0.24436 0.83208 0.027049
## 21 0.00093985    200   0.18045 0.84712 0.027188
## 22 0.00083542    204   0.17669 0.84336 0.027154
## 23 0.00062657    207   0.17419 0.85714 0.027278
## 24 0.00050125    213   0.17043 0.85714 0.027278
## 25 0.00027847    218   0.16792 0.86842 0.027376
## 26 0.00025063    227   0.16541 0.86842 0.027376
## 27 0.00001000    237   0.16291 0.86842 0.027376

The 8th tree has the lowest cross-validation error (xerror): 0.75564.

# prune by lowest cp
pruned.ct <- prune(cv.ct, 
                   cp = cv.ct$cptable[which.min(cv.ct$cptable[,"xerror"]),"CP"])
length(pruned.ct$frame$var[pruned.ct$frame$var == "<leaf>"])
## [1] 21
prp(pruned.ct, type = 1, extra = 1, split.font = 1, varlen = -10)

6.1.3 Apply Model

# apply model on training set
tree.pred.train <- predict(pruned.ct, train, type = "class")
# generate confusion matrix for training data
confusionMatrix(tree.pred.train, train$binned_score)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction (0,4] (4,6] (6,8] (8,10]
##     (0,4]      0     0     0      0
##     (4,6]     45   378   115      0
##     (6,8]     22   266  1305     33
##     (8,10]     0     0     8     54
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7803          
##                  95% CI : (0.7625, 0.7974)
##     No Information Rate : 0.6415          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5228          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: (0,4] Class: (4,6] Class: (6,8] Class: (8,10]
## Sensitivity                0.0000       0.5870       0.9139       0.62069
## Specificity                1.0000       0.8989       0.5977       0.99626
## Pos Pred Value                NaN       0.7026       0.8026       0.87097
## Neg Pred Value             0.9699       0.8424       0.7950       0.98475
## Prevalence                 0.0301       0.2893       0.6415       0.03908
## Detection Rate             0.0000       0.1698       0.5863       0.02426
## Detection Prevalence       0.0000       0.2417       0.7305       0.02785
## Balanced Accuracy          0.5000       0.7429       0.7558       0.80847

Accuracy is 0.7803 for training set.

# apply model on validation set
tree.pred.valid <- predict(pruned.ct, valid, type = "class")
# generate confusion matrix for validation data
confusionMatrix(tree.pred.valid, valid$binned_score)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction (0,4] (4,6] (6,8] (8,10]
##     (0,4]      0     0     0      0
##     (4,6]      9    88    62      0
##     (6,8]      6   121   424     10
##     (8,10]     0     0     5     17
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7129          
##                  95% CI : (0.6789, 0.7453)
##     No Information Rate : 0.6617          
##     P-Value [Acc > NIR] : 0.001616        
##                                           
##                   Kappa : 0.345           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: (0,4] Class: (4,6] Class: (6,8] Class: (8,10]
## Sensitivity               0.00000       0.4211       0.8635       0.62963
## Specificity               1.00000       0.8668       0.4542       0.99301
## Pos Pred Value                NaN       0.5535       0.7558       0.77273
## Neg Pred Value            0.97978       0.7925       0.6298       0.98611
## Prevalence                0.02022       0.2817       0.6617       0.03639
## Detection Rate            0.00000       0.1186       0.5714       0.02291
## Detection Prevalence      0.00000       0.2143       0.7561       0.02965
## Balanced Accuracy         0.50000       0.6439       0.6589       0.81132

Accuracy is 0.7129 for validation set.

# apply model on test set
tree.pred.test <- predict(pruned.ct, test, type = "class")
# generate confusion matrix for test data
confusionMatrix(tree.pred.test, test$binned_score)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction (0,4] (4,6] (6,8] (8,10]
##     (0,4]      0     0     0      0
##     (4,6]      8   107    76      0
##     (6,8]      5   105   423     10
##     (8,10]     0     0     1      8
## 
## Overall Statistics
##                                          
##                Accuracy : 0.7241         
##                  95% CI : (0.6904, 0.756)
##     No Information Rate : 0.6729         
##     P-Value [Acc > NIR] : 0.001485       
##                                          
##                   Kappa : 0.3651         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: (0,4] Class: (4,6] Class: (6,8] Class: (8,10]
## Sensitivity                0.0000       0.5047       0.8460       0.44444
## Specificity                1.0000       0.8418       0.5062       0.99862
## Pos Pred Value                NaN       0.5602       0.7790       0.88889
## Neg Pred Value             0.9825       0.8098       0.6150       0.98638
## Prevalence                 0.0175       0.2853       0.6729       0.02423
## Detection Rate             0.0000       0.1440       0.5693       0.01077
## Detection Prevalence       0.0000       0.2571       0.7308       0.01211
## Balanced Accuracy          0.5000       0.6733       0.6761       0.72153

Accuracy is 0.7241 for test set.

6.2 Random Forest

6.2.1 Build Model

library(randomForest)
## Warning: package 'randomForest' was built under R version 3.5.3
set.seed(53)
rf <- randomForest(binned_score ~ . -imdb_score, data = train, mtry = 5)
# Show model error
plot(rf)
legend('topright', colnames(rf$err.rate), col=1:5, fill=1:5)

The black line shows the overall error rate which falls below 30%. The red, green, blue and aqua lines show the error rate for bad, ok, good and excellent movies respectively. We can see that right now we’re much more successful predicting good movies. We cannot predict bad movies very well.

Let’s look at relative variable importance by plotting the mean decrease in Gini calculated across all trees.

# Get importance
importance <- importance(rf)
varImportance <- data.frame(Variables = row.names(importance), 
                            Importance = round(importance[ ,'MeanDecreaseGini'],2))

# Create a rank variable based on importance
rankImportance <- varImportance %>%
  mutate(Rank = paste0('#',dense_rank(desc(Importance))))

# Use ggplot2 to visualize the relative importance of variables
ggplot(rankImportance, aes(x = reorder(Variables, Importance), 
                           y = Importance, fill = Importance)) +
  geom_bar(stat='identity') + 
  geom_text(aes(x = Variables, y = 0.5, label = Rank),
            hjust=0, vjust=0.55, size = 4, colour = 'red') +
  labs(x = 'Variables') +
  coord_flip() + 
  theme_few()

From the plot, we see User_vote is a very important variable, while face_number, content and country are not so important.

6.2.2 Apply Model

set.seed(632)
# apply model on validation set
rf.pred.valid <- predict(rf, valid)
# generate confusion matrix for validation data
confusionMatrix(rf.pred.valid, valid$binned_score)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction (0,4] (4,6] (6,8] (8,10]
##     (0,4]      0     0     0      0
##     (4,6]      8   104    42      0
##     (6,8]      7   105   448     12
##     (8,10]     0     0     1     15
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7642          
##                  95% CI : (0.7319, 0.7943)
##     No Information Rate : 0.6617          
##     P-Value [Acc > NIR] : 8.023e-10       
##                                           
##                   Kappa : 0.4523          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: (0,4] Class: (4,6] Class: (6,8] Class: (8,10]
## Sensitivity               0.00000       0.4976       0.9124       0.55556
## Specificity               1.00000       0.9062       0.5060       0.99860
## Pos Pred Value                NaN       0.6753       0.7832       0.93750
## Neg Pred Value            0.97978       0.8214       0.7471       0.98347
## Prevalence                0.02022       0.2817       0.6617       0.03639
## Detection Rate            0.00000       0.1402       0.6038       0.02022
## Detection Prevalence      0.00000       0.2075       0.7709       0.02156
## Balanced Accuracy         0.50000       0.7019       0.7092       0.77708

Accuracy is 0.7642 for validation set.

set.seed(633)
# apply model on test set
rf.pred.test <- predict(rf, test)
# generate confusion matrix for test data
confusionMatrix(rf.pred.test, test$binned_score)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction (0,4] (4,6] (6,8] (8,10]
##     (0,4]      0     0     0      0
##     (4,6]      7   115    49      0
##     (6,8]      6    97   449     10
##     (8,10]     0     0     2      8
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7699          
##                  95% CI : (0.7379, 0.7997)
##     No Information Rate : 0.6729          
##     P-Value [Acc > NIR] : 4.201e-09       
##                                           
##                   Kappa : 0.4585          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: (0,4] Class: (4,6] Class: (6,8] Class: (8,10]
## Sensitivity                0.0000       0.5425       0.8980       0.44444
## Specificity                1.0000       0.8945       0.5350       0.99724
## Pos Pred Value                NaN       0.6725       0.7989       0.80000
## Neg Pred Value             0.9825       0.8304       0.7182       0.98636
## Prevalence                 0.0175       0.2853       0.6729       0.02423
## Detection Rate             0.0000       0.1548       0.6043       0.01077
## Detection Prevalence       0.0000       0.2301       0.7564       0.01346
## Balanced Accuracy          0.5000       0.7185       0.7165       0.72084

Accuracy is 0.7658 for test set.

7 Conclusion

Accuracy table for different models:

Dataset Decision Tree Random Forest
Training 0.7803
Validation 0.7129 0.7642
Test 0.7241 0.7658

For Decision tree model, we have a higher accuracy for training data because the tree was built based on the training data.

Based on the overall performance, we find that random forest is a better model, which gives a higher accuracy around 0.76.